Feature matching based on local windows aggregation

Summary The core goal of feature matching is to establish correspondences between two images. Current methods without detectors achieve impressive results but often focus on global features, neglecting regions with subtle textures and resulting in fewer matches in areas with weak textures. This paper proposes a feature-matching method based on local window aggregation, which balances global features and local texture variations for more accurate matches, especially in weak-texture regions. Our method first applies a local window aggregation module to minimize irrelevant interference using window attention, followed by global attention, generating coarse and fine-grained feature maps. These maps are processed by a matching module, initially obtaining coarse matches via the nearest neighbor principle. The coarse matches are then refined on fine-grained maps through local window refinement. Experimental results show our method surpasses state-of-the-art techniques in pose estimation, homography estimation, and visual localization under the same training conditions.


INTRODUCTION
Feature matching is a fundamental task in computer vision, aiming to establish correspondences between features in pairs of images.It serves as a cornerstone for many three-dimensional vision tasks, including structure-from-motion (SfM), 1 3D reconstruction, 2 visual localization, 3,4 and pose estimation. 5,6Feature matching also finds interdisciplinary applications.In the medical field, image-matching techniques are used to register medical images taken by different devices or, at other times, allow doctors to analyze images from various perspectives and improve diagnostic accuracy comprehensively.For microscope images, feature matching is employed to detect and track changes in cells or subcellular structures, aiding research on cell behavior and biological processes.In environmental science, feature matching techniques can be used to compare glacier images taken at different times to analyze glacier advancement or retreat and study the impact of climate change on glaciers.It can also detect and track cracks and fissures in ice caps to predict potential iceberg calving events.In the preservation of cultural heritage, feature matching technology can match and align artifact fragments, assisting restorers in reconstructing damaged historical artifacts.Additionally, it can integrate different images of artifacts to create high-precision 3D digital models for preservation and display.][9][10][11] However, feature matching faces innumerable challenges, including variations in lighting conditions, scale changes, poor texture, and repetitive patterns, all of which significantly increase the difficulty of obtaining consistent and accurate matching results.
7][18] Detector-based methods first utilize keypoint detectors to identify keypoints between images and then establish correspondences between these keypoints.The quality of keypoints significantly influences the matching performance, so many studies have focused on optimizing keypoint detection through multi-scale detection 19 and reliability verification 9 to improve matching performance while maintaining high computational and memory efficiency.However, these methods often struggle to find reliable matches in textureless regions where keypoints are challenging to detect.In contrast, detector-free matching methods do not require keypoints to be detected beforehand; instead, they directly attempt to establish pixel-level correspondences between features, enabling matching in textureless regions.In recent years, Transformer-based matching methods have gained widespread use due to their advantage in capturing long-range dependencies. 11,16,20,21Representative works such as LoFTR 11 utilize a linear transformer 22 in the coarse matching stage to obtain and refine global features.COTR 23 iteratively computes shared visible regions through attention mechanisms to address the scale variation issue.These Transformer-based methods demonstrate the effectiveness of attention mechanisms in feature matching.However, recent research 24,25 suggests that Transformers may lack spatial perception bias in continuous dense prediction tasks, which could lead to inconsistent matching results.Previous methods used global attention mechanisms, leading to ignoring local feature information and poor performance matching weak texture regions.Our method differs from earlier methods by focusing more on the local features of the image, as the nature of image features is significantly localized.At the same time, we do not want to lose global

Related work
Feature matching can be broadly categorized into two main types: detector-based methods and detector-free methods.Detector-based feature matching typically consists of three main stages: feature detection, description, and matching.Manually designed feature detectors such as SIFT 26 and ORB 27 are well-known in feature detection.In recent years, learning-based methods 7,9,12,13,[26][27][28] have demonstrated superior performance to traditional handcrafted methods.For instance, D2Net 13 combines feature detection and description stages, while R2D2 9 aims to train a network to identify reliable and repeatable features.Additionally, SuperGlue 15 proposes an attention-based graph neural network (GNN) that optimizes extracted features through the alternating update of self-attention and cross-attention.However, detectorbased methods rely on local feature extractors, which may limit performance in challenging scenarios such as repetitive textures, weak textures, and illumination variations.In contrast, detector-free methods do not rely on local feature detectors; instead, they directly find dense feature matches between pixels.This approach circumvents the limitations of traditional feature extraction, allowing for more flexible and extensive matching in various complex environments.Learning-based methods were first adopted in literature, 29,30 where pixel-level feature descriptors were learned using contrastive loss.Similar to detector-based methods, matching of dense descriptors is typically achieved through nearest neighbor search.NCNet 10 adopts a different strategy by directly learning dense correspondences through an end-to-end learning approach.This method constructs a 4D cost volume to enumerate all potential image matches and normalizes them through 4D convolution to ensure neighborhood consistency among matches.Sparse NCNet 18 improves upon NCNet 10 by introducing sparse convolution to enhance efficiency.Following this line of research, DRC-Net 8 utilizes CNN feature maps of two different resolutions to construct two 4D matching tensors, which are then fused to achieve high-confidence feature matching.Additionally, a coarse-to-fine matching strategy is proposed to improve the accuracy of dense matching.While the 4D cost volume considers all possible matches, the receptive field of 4D convolution remains limited to the neighborhood region of each match.In recent years, Transformer models 31 have garnered widespread attention in the computer vision domain.When handling visual tasks such as image classification, [32][33][34] object detection, [35][36][37] and image  38,39 Transformers leverage their global interaction capability to explore key regions in images.Due to their outstanding performance, Transformer techniques have also been applied in image feature matching. 11,15,21Despite significant achievements, the original attention mechanism of Transformers incurs high computational costs when processing high-resolution images.Therefore, various approximation methods 20,22,40,41 have been proposed to reduce costs, often at the expense of performance.For example, linear attention mechanisms 22 approximate the softmax function using the ELU 42 function to reduce the computational complexity linearly, albeit weakening the model's focusing ability.ASpanFormer 16 introduces an adaptive attention breadth selection method, which, while flexible, often overlooks the importance of local consistency.Additionally, Transformers may sometimes ignore local feature information in image tasks. 43To overcome these limitations, we propose a novel strategy that avoids additional computational and memory overheads by aggregating features using local windows to maintain local consistency, effectively enhancing the accuracy and efficiency of feature matching.

Method
Our method follows the overall process outlined in Figure 2 to perform feature matching between two images.Our method mainly consists of three modules: the local windows aggregation module, the coarse matching module, and the fine matching module.Below, we briefly introduce the entire process.Given images I A and I B , first, we extract multi-scale feature maps for each image using the local windows aggregation module.We denote the feature map of size 1=i as F 1=i = fF 1=i A ;F 1=i B g. Next, we input F 1=8 into the coarse matching module for coarse-grained feature matching.We use the nearest neighbor principle to obtain a confidence matrix P c and predict coarse-grained matches M c based on a confidence threshold.Finally, we input F 1=2 , F 1=8 and the coarse-grained matches M c into the fine matching module.We upsample the F 1=8 and fuse them with the F 1=2 before performing fine-grained matching.We crop local windows from the F 1=2 and compute the spatial expectation coordinates of the two-dimensional heatmap for each local window to obtain the final matching results M f .

Local windows aggregation
The local windows aggregation (LWA) module, as shown in Figure 3, takes the input image I ˛R23H3W31 (2, H, W, 1, representing the number of images, height, width, and the number of channels) and undergoes four rounds of local window aggregation processing and feature pyramid network (FPN) processing to obtain F 1=2 and F 1=8 .The C 0 , C 1 , C 2 , and C 3 represent the dimensions of the feature maps.Initially, the image is a grayscale image with only one dimension.After the first local window aggregation, the feature dimension changes from 1 to C 0 , halving the height and width.After the second local window aggregation, the feature dimension changes from C 0 to C 1 , and the height and width are halved again.After the third local window aggregation, the feature dimension changes from C 1 to C 2 , and the height and width are halved again.After the fourth local window aggregation, the feature dimension changes from C 2 to C 3 , and the height and width are halved once more.Finally, the four feature maps are fused through a feature pyramid, outputting two feature maps for subsequent coarse-grained and fine-grained feature matching.The dimensions of these two feature maps are C 0 and C 2 , respectively.We denote the processing at each stage as LWA i ð $Þ and the FPN as FPNð $Þ.The local window aggregation processing is represented as: (Equation 1)  2) Finally, we obtain feature maps of original image sizes 1/8 and 1/2.In local window aggregation, the attention mechanism plays a central role.Ordinary attention calculation requires three inputs: Q (query), K (key), and V (value).The attention output is a weighted sum, where the weight matrix is determined by Q and its corresponding K.This process can be described as: However, in visual tasks, the size of the weight matrix SoftMaxðQK T ÞV grows quadratically with the increase in image resolution.When the image resolution is high, ordinary attention's memory and computational costs become prohibitive.To address this issue, linear attention has been proposed, 22 which replaces the softmax operation with the product of two kernel functions: 4) Since the number of feature channels is much smaller than the number of pixels, the computational complexity decreases from quadratic to linear.Therefore, we adopt linear attention for global attention.
The local attention mechanism is inspired by the Swin Transformer, 40 but unlike it, we do not use the shifting window operation and instead replace it with global attention.Additionally, we do not require the image height and width to be exact multiples of the window size; if they are not divisible, we pad the image accordingly.For an image input of size H 3 W 3 C, we first reshape it into a feature map of size HW M 2 3 M 2 3 C and divide this feature map into M3M non-overlapping local windows, where HW M 2 represents the number of windows.Then, self-attention is computed separately for each window.The matrix calculation method for local window features X ˛RM 2 3C is specific and used to implement local attention: (Equation 5) where P Q ; P K ; P V are projection matrices shared across different windows.Typically, Q;K;V ˛RM 2 3d .The attention matrix calculated through the self-attention mechanism within the local window is: 6) where d represents the number of heads in multi-head attention.We denote this as W-MSA, which stands for window multi-head attention mechanism.The specific steps are illustrated in Figure 4, starting with a convolution layer that increases the tensor channels while halving the dimensions.Subsequent processing involves local window attention and global attention, a step that can be repeated N times to enhance the feature extraction effect.

Coarse and fine matching module
Through the local window aggregation module, we obtain F 1=2 and F 1=8 .Firstly, we perform cross-attention processing on the F 1=8 of the image pair, which enhances the effectiveness of coarse-grained feature matching.Then, on the processed F 1=8 , we calculate the score matrix S between transformed features using Sði;jÞ = 1 t $C FA ðiÞ; FB ðjÞD, where t is a temperature coefficient.Subsequently, softmax is applied along both dimensions of S (referred to as double softmax) to obtain the probability of soft mutual nearest neighbor matching.When using double softmax, the matching probability P c is given by: P c ði; jÞ = soft max ðSði; $ÞÞ j $soft max ðSð$; jÞÞ i (Equation 7) Based on the confidence matrix P c , matches with confidence higher than the threshold q c are selected, and the Mutual Nearest Neighbor (MNN) criterion is further applied to filter potential outlier coarse matches.We represent the coarse match prediction as: M c = fð ĩ; jÞjcð ĩ; jÞ ˛MNNðP c Þ; P c ð ĩ; jÞ R q c g (Equation 8) After establishing coarse-grained matching, we merge the upsampled F 1=8 , processed with cross-attention, with F 1=2 to obtain a new F 1=2 .
Then, we perform another round of cross-attention processing on the new F 1=2 to further enhance the effect of coarse-grained feature matching.We utilize an association-based approach to refine the coarse-grained matching to the original image resolution.For each coarse-

Loss function
The final loss consists of coarse-grained loss and fine-grained loss: L = L c + L f .We have tested different loss weights, such as L = L c + 2L f , L = 2L c + L f and L = 0:5L c + L f .But the final training performance was slightly weaker than L = L c + L f .Therefore, in our final model, the weights for coarse-grained and fine-grained loss are equal.
The coarse-grained loss function is defined by the negative log likelihood loss returned from the double softmax operation on the confidence matrix P c .The true label values for the confidence matrix during training are calculated using camera poses and depth maps.We define the true coarse-grained matches M gt c as the mutual nearest neighbors of the two sets of 1/8 resolution networks.The distance between two networks is measured by their reprojection distance from their center positions.We minimize the negative log likelihood loss of the networks in M gt c .9) For fine-grained refinement at the pixel level, we use the L2 loss.For each point b i to be matched, we measure its uncertainty by computing the total variance of the corresponding heatmap, denoted as s 2 ð b iÞ.The goal is to optimize the positions of fine-grained matches with lower uncertainty, resulting in the weighted loss function: 10) Here, b j 0 gt is computed by mapping each b i from F

Experiments
We used the ScanNet 44 dataset for indoor training and the MegaDepth 45 for outdoor training.Drawing inspiration from the methods of LoFTR 11 and SuperGlue, 15 we trained our model by sampling image pairs with overlap scores ranging from 0.4 to 0.8.Our model utilized  (3, 3, 6, 9) respectively.Coarse-grained and fine-grained cross-attention was applied four times.We also set the parameter q c to 0.2, and the window sizes for the local aggregation module and fine-grained matching were set to 5 3 5.

Pose estimation
Camera pose estimation has a wide range of applications in various fields.In augmented reality, precise analysis of the camera's position and orientation in space allows virtual objects to be accurately superimposed onto real-world scenes, enhancing the user's perceptual experience.In minimally invasive surgery, camera pose estimation is used to track the position and orientation of surgical instruments in real time, assisting doctors in performing precise operations and improving the safety and efficacy of surgeries.Intelligent security systems it is employed for the automatic calibration and adjustment of surveillance cameras, thereby enhancing the level of intelligent management in public safety.Our method can achieve more accurate estimations in camera pose estimation, which in turn can help downstream applications (such as those mentioned previously) to operate more quickly and accurately.
Indoor image matching faces challenges due to lack of texture, high self-similarity, and complex three-dimensional geometric structures.To demonstrate the effectiveness of our method in pose estimation for indoor scenes, we selected the ScanNet 44 dataset and our dataset of weakly textured indoor wall images for experimentation.Table 1 compiles the area under the curve (AUC) of pose error and precision for various methods, where our method outperforms the previous best method, showing significant improvements in AUC@20 and precision, with a 1.14% increase in AUC@20 in particular.Experiments were conducted for outdoor pose estimation using the MegaDepth 45 test Figure 6.The feature matching performance on indoor and outdoor walls, our method can detect more weak texture correspondences dataset and our weakly textured outdoor wall images dataset.Table 2 similarly compiles the AUC of pose error and accuracy for various methods, with our method showing improvements in AUC@5 , AUC@10 , and precision, notably achieving a 1.8% increase in precision.Figure 5 intuitively shows the feature-matching performance of our method compared to others in indoor and outdoor scenarios, with our method achieving a higher number of correspondences.

Homography estimation
Homography detection has significant applications in image processing and computer vision, especially in image registration and perspective transformation.In augmented reality, homography detection can be used to accurately overlay virtual objects onto real-world images, achieving realistic augmented reality effects.In robotic navigation, homography detection enables robots to plan paths and avoid obstacles in complex environments, thereby improving the accuracy and safety of navigation.In medical image processing, homography detection is employed to register images from different perspectives, providing more comprehensive diagnostic information.Our method achieves higher accuracy in homography estimation, which can enhance the outcomes of downstream tasks, offering richer information and more substantial presentation effects.We conducted unified homography estimation tests on the HPatches 51 dataset and our weakly textured indoor wall images dataset.In the tests, a reference image was paired with five other photos.Feature matching was performed for each image pair, and homography estimation was calculated using OpenCV, adopting the RANSAC method to enhance robustness.Table 3 compiles AUC of angular error, precision, and recall rate at different thresholds (3 pixels, 5 pixels, 10 pixels) for various methods.Our method outperformed others in AUC@3px, AUC@10px, precision, and recall rate, with the most remarkable improvement observed in AUC@10px, increasing by 2.5%. Figure 6 intuitively shows the matching performance of our method compared to others on indoor and outdoor weakly textured walls, with our method achieving a higher number of correspondences and detecting more weak texture information, such as performing feature matching on trees outdoors.

Visual localization
Visual localization plays a critical role in several fields.In autonomous driving, visual localization helps vehicles accurately determine their position on the road, enhancing driving safety and reliability.In robotic navigation, visual localization provides precise positional data in unknown environments, aiding robots in autonomous navigation.In virtual reality applications, visual localization helps systems accurately track the user's position and posture, providing an immersive experience.By using our method, downstream tasks can achieve more precise localization, thereby improving the safety and reliability of applications.
We conducted experiments on visual localization.We evaluated our method on the long-term visual localization benchmark, 52 which focuses on benchmarking visual localization methods under different conditions, such as day-to-night changes, scene geometry changes, and indoor scenes with large areas of texture-less regions.The evaluation was performed using the HLoc 53 method on the InLoc 4 dataset.Table 4 compiles the localization accuracy at angles of 10 and distances of 0.25m, 0.50m, and 1m under DUC1 and DUC2 conditions.Our method showed significant improvements, especially under DUC2, where the accuracy increased by an average of 1.65%.

Robustness experiments
To evaluate the robustness of our method under changes in illumination and viewpoint, we conducted feature-matching tests using the HPatches 51 and our weakly textured indoor wall dataset.We calculated the mean matching accuracy (MMA) and the correspondences from a 1-pixel to 10-pixel threshold. 21Table 5 and Table 6 compile the number of correspondences for different methods on the HPatches 51 and our weakly textured indoor walls dataset, where our method performed the best, having the highest number of correspondences, averaging an improvement of 25.53% over the best.Figure 7 intuitively shows the matching performance of our method compared to others on indoor and outdoor weakly textured walls under different lighting conditions, with our method detecting a higher number of correspondences and more weak texture correspondences.Figure 8 compiles the average matching accuracy of different methods under various lighting conditions and viewpoint changes on the HPatches 51 and our weakly textured indoor walls dataset.Under changes in illumination, our method had the best matching accuracy from 1 pixel to 10 pixels.In the case of viewpoint changes, our method had higher matching accuracy than other end-to-end methods, slightly lower than the detector-based method SuperPoint+IMP.Considering different lighting and viewpoint changes, below a 5-pixel threshold, our method exhibited the highest matching accuracy, and in the 6 to 10-pixel range, it was higher than other end-to-end methods and slightly lower than the detector method SuperPoint+IMP.These experimental results fully demonstrate the high robustness of our method.

Weak texture region matching
We validated the effectiveness of our method for matching weak texture regions on the ScanNet 44 weak texture dataset.First, we captured the local texture features of the image using the Gray-Level Co-occurrence Matrix (GLCM).Then, we distinguished between high and lowtexture regions by analyzing contrast and homogeneity.The comparison results are shown in Figure 9.In the contrast map, low-texture areas  are represented in cool colors, while high-texture regions are shown in warm colors.Conversely, in the homogeneity map, low-texture areas are depicted in warm colors and high-texture regions in cool colors.Figure 9 demonstrates that most of the image consists of low-texture regions.Yet, our method effectively performs feature matching in these low-texture areas, with many matching feature points.Therefore, our method proves to be effective for matching between weak texture images.

Running time
The original Transformer has a high computational cost, so even though our method uses a modified version of the Transformer to accelerate computation, it still requires slightly higher computational resources.We calculated the number of parameters and GFLOPs for LoFTR, 11 DKM, 48 Matchformer, 21 MR-Matcher, 50 OAMatcher, 49 and ours, as shown in Table 7. From this, we can see that we are not the best, but we are not the worst either.Overall, it is acceptable.We are also researching ways to reduce the computational load of the Transformer.

Ablation study
The experiments in this subsection were all tested on the Scannet 44 dataset.To verify the rationality of the model design, we conducted ablation experiments.We compared the impact of different local window sizes on the model, and the results are shown in Table 8.When the local window size increases, although the accuracy improves, the improvement is limited, and the number of matched feature points decreases sharply.Therefore, to obtain more matched feature points, we ultimately set the local window size to 5 3 5.Meanwhile, we also compared the impact of different modules on the model.W-MSA represents local window attention, L-MSA represents global attention, coarse-cross represents cross-attention processing in the coarse-grained matching stage, and fine-cross represents cross-attention processing in the finegrained matching stage, with the number in parentheses indicating the number of cross-attentions.As shown in Table 9, we tested the effects of different module combinations, and the results without cross-attention processing in the matching stage were the worst.We also compared the number of cross-attentions in the matching stage.Simply increasing the number of cross-attentions does not improve the results and instead adds unnecessary parameters.

The differences between our local window approach and ASpanformer
Our method differs from the local attention mechanism in ASpanformer 16 in several ways.Firstly, ASpanformer's 16 local attention is applied in cross-attention, while our local attention is used in self-attention.We believe that in feature matching, local attention should not be used in cross-attention because cross-attention compares the features of two images for matching.If local attention is applied at this stage, it will prevent the two images from perceiving each other's global context.For instance, region A in image 1 corresponds to region B in image 2, but we do not know which regions correspond before performing cross-attention.Therefore, local cross-attention might end up processing region A in image 1 with region D in image 2, resulting in poor matching outcomes.We compared the feature matching results of ASpanformer 16 and our method using data from ScanNet.As shown in Figure 10, the feature points matched by our method are evenly distributed, while those matched by ASpanformer 16 are biased toward the upper half of the image.Additionally, our method matches more feature points.(The image sizes for the results of the two methods differ because ASpanformer's 16 code resizes the images to 352 3 512 while we resize them to 640 3 480).Secondly, the position where we use local attention is different.We use it during the feature extraction stage, whereas ASpanformer 16 uses it during the feature matching stage after the extracted features.Thirdly, our method has a fixed window size, while ASpanformer's 16 window size varies based on the uncertainty of the corresponding coordinates.

Conclusion
This study proposes a feature-matching method based on local window aggregation.Considering that a global Transformer may lead to inconsistent matching results, we designed a module for local window aggregation, which combines local windows with global attention intersection.This approach ensures local consistency without being influenced by irrelevant regions to enable the detection of more correspondences.We designed a matching module progressing from coarse to fine to obtain more accurate matching results.Coarse matches are obtained through cross-attention applied to coarse-grained features, which are fused with fine-grained features using cross-attention.Refinement is performed on the fine-grained level using the coarse matching results to achieve the ultimate sub-pixel level matching accuracy.Under the same training conditions, our method improves AUC@20 in indoor pose estimation by 1.14%, precision in outdoor pose estimation by 1.8%, AUC@10px in homography estimation by 2.5%, and accuracy on DUC2 in visual localization by 1.65%, achieving the best results.In matching weak texture areas, our method has the highest number of matches, with an increase of 23.53%.Under lighting conditions, our method demonstrated the best matching accuracy from 1 pixel to 10 pixels.Our method exhibited higher is relatively low.This may be because our method did not use a dataset with exaggerated viewpoint changes during training, which hindered our model from extracting handy features and resulted in poor performance during the matching phase.Thus, our future research will also concentrate on feature matching under extreme viewpoint changes.

Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Wenpeng Li (leewenpeng@126.com).

Materials availability
This study did not generate new unique reagents.

STAR+METHODS KEY RESOURCES TABLE EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS
The core objective of feature matching is establishing correspondences between two images.While current feature-matching methods without detectors have achieved impressive results, they often prioritize global features, neglecting regions with subtle texture variations, resulting in fewer matching points, especially in regions with weak textures.This paper proposes a feature-matching method based on local windows aggregation, which considers global features and focuses more on texture variations within local windows to achieve more accurate matching points and more accurate correspondences in regions with weak textures.Our method first employs a local window aggregation module to reduce irrelevant interference by applying window attention processing, followed by global attention processing, and generates coarse-grained and fine-grained feature maps.Subsequently, these feature maps are further processed by a matching module, and coarsegrained matches are obtained using the nearest neighbor principle.Finally, the coarse-grained matching results are refined on fine-grained feature maps after fusing coarse-grained and fine-grained feature maps, using local window refinement to obtain the ultimate matching results.Experimental results demonstrate that our method outperforms state-of-the-art methods in pose estimation, homography estimation, and visual localization under the same training conditions.

METHOD DETAILS Processing procedure
Our method mainly consists of three modules: the local windows aggregation module, the Coarse Matching module, and the Fine Matching module.Below, we briefly introduce the entire process.Given images I A and I B , first, we extract multi-scale feature maps for each image using the local windows aggregation module.We denote the feature map of size 1=i as F 1=i = fF 1=i A ; F 1=i B g. Next, we input F 1=8 into the Coarse Matching module for coarse-grained feature matching.We use the nearest neighbor principle to obtain a confidence matrix P c and predict coarse-grained matches M c based on a confidence threshold.Finally, we input F 1=2 , F 1=8 and the coarse-grained matches M c into the Fine Matching module.We upsample the F 1=8 and fuse them with the F 1=2 before performing fine-grained matching.We crop local windows from the F 1=2 and compute the spatial expectation coordinates of the two-dimensional heatmap for each local window to obtain the final matching results M f .
The local windows aggregation (LWA) module takes the input image I ˛R23H3W31 (2, H, W, 1, representing the number of images, height, width, and the number of channels) and undergoes four rounds of local window aggregation processing and feature pyramid network (FPN) processing to obtain F 1=2 and F 1=8 .The C 0 , C 1 , C 2 , and C 3 represent the dimensions of the feature maps.Initially, the image is a grayscale image with only one dimension.After the first local window aggregation, the feature dimension changes from 1 to C 0 , halving the height and width.After the second local window aggregation, the feature dimension changes from C 0 to C 1 , and the height and width are halved again.After the third local window aggregation, the feature dimension changes from C 1 to C 2 , and the height and width are halved again.After the fourth local window aggregation, the feature dimension changes from C 2 to C 3 , and the height and width are halved once more.Finally, the four feature maps are fused through a feature pyramid, outputting two feature maps for subsequent coarse-grained and finegrained feature matching.The dimensions of these two feature maps are C 0 and C 2 , respectively.We denote the processing at each stage as LWA i ð $Þ and the FPN as FPNð $Þ.The local window aggregation processing is represented as: I i = LWA i ðI i Þ; i = 1; 2; 3; 4 (Equation 11) o = FPNðI i Þ; i = 1; 2; 3; 4 (Equation 12) REAGENT or RESOURCE SOURCE IDENTIFIER

Deposited data
Finally, we obtain feature maps of original image sizes 1/8 and 1/2.In local window aggregation, the attention mechanism plays a central role.Ordinary attention calculation requires three inputs: Q (query), K (key), and V (value).The attention output is a weighted sum, where the weight matrix is determined by Q and its corresponding K.This process can be described as: 13) However, in visual tasks, the size of the weight matrix SoftMaxðQK T ÞV grows quadratically with the increase in image resolution.When the image resolution is high, ordinary attention's memory and computational costs become prohibitive.To address this issue, linear attention has been proposed, 22 which replaces the softmax operation with the product of two kernel functions: 14) Here, fð $Þ = eluð $Þ + 1.Since the number of feature channels is much smaller than the number of pixels, the computational complexity decreases from quadratic to linear.Therefore, we adopt linear attention for global attention.
The local attention mechanism is inspired by the Swin Transformer, 40 but unlike it, we do not use the shifting window operation and instead replace it with global attention.Additionally, we do not require the image height and width to be exact multiples of the window size; if they are not divisible, we pad the image accordingly.For an image input of size H 3 W 3 C, we first reshape it into a feature map of size HW M 2 3 M 2 3 C and divide this feature map into M3M non-overlapping local windows, where HW M 2 represents the number of windows.Then, self-attention is computed separately for each window.The matrix calculation method for local window features X ˛RM 2 3C is specific and used to implement local attention: 15) where P Q ; P K ; P V are projection matrices shared across different windows.Typically, Q; K; V ˛RM 2 3d .The attention matrix calculated through the self-attention mechanism within the local window is: 16) where d represents the number of heads in multi-head attention.We denote this as W-MSA, which stands for window multi-head attention mechanism.

Matching steps
The steps for using the model to perform matching are as follows.First, read the two images to be matched, Image A and Image B, and convert the data of the two images into tensor types tensor A and tensor B .Second, construct a dictionary object named data and place tensor A and tensor B into the dictionary with ''image0'' and ''image1'' as the keys.Third, pass the data object into the model for prediction.Fourth, the final prediction results are stored in the data object, where ''mkpts0 f '' and ''mkpts1 f '' are the coordinates of the feature matches for Image A and Image B, respectively, and ''mconf '' is the confidence level of the feature matches.

QUANTIFICATION AND STATISTICAL ANALYSIS
All statistical details and sample sizes are provided.The exact statistical tests and variables used are described in the text and the legends of the tables and figures.

Figure 1 .
Figure 1.Our method can detect more correspondences compared to LoFTR

Figure 2 .
Figure 2. The overall process of our method

Figure 3 .Figure 4 .
Figure 3.The structure of the local window aggregation module grained match ð ĩ; jÞ, we first locate its position ð b i; b jÞ on F 1=2 .Then, we crop out two sets of local windows of size w 3 w.Generating two transformed local feature maps, F 1=2 A ð b iÞ and F 1=2 B ð b jÞ, with b i and b j as centers, we associate the central vector of F 1=2 A ð b iÞ with all vectors of F 1=2 B ð b jÞ to obtain a heatmap representing the probability of each pixel in the neighborhood of b j matching with b i.By calculating the expectation on the probability distribution, we get the final position b j 0 on I B with sub-pixel accuracy.Summing up all matches fð b i; b j 0 Þg results in the final finegrained matching, M f .

1=2 A ð b iÞ to F 1=2 B
ð b jÞ using the true camera poses and depth information.When calculating L f , if the mapped position of b i falls outside the local window of F 1=2 B ð b jÞ, we ignore this pair ð b i; b j 0 Þ.During training, gradients do not propagate back through s 2 ð b iÞ.

Figure 5 .
Figure 5.The feature matching performance of indoor and outdoor objects, our method can detect more correspondences

Figure 7 .
Figure 7.The feature matching performance on indoor and outdoor walls under different lighting conditions; our method can detect more correspondences

Figure 8 .Figure 9 .
Figure 8.The average matching accuracy under illumination and viewpoint changes, as well as overall, on the HPatches and weakly textured indoor wall dataset

Table 1 .
Indoor pose estimation on ScanNet

Table 2 .
Outdoor pose estimation on MegaDepth

Table 3 .
Homography estimation on the HPatches and weakly textured indoor wall datasets

Table 4 .
Visual localization using the HLoc method

Table 5 .
The number of correspondences on the HPatches indoor dataset

Table 6 .
The correspondences on the weakly textured indoor wall image dataset

Table 8 .
The impact of Different Local Window Sizes on the Model